STA4173 Lecture 3, Summer 2023
Independent: An individual selected for one sample does not dictate which individual is to be in a second sample.
e.g., there are two sections of STA4173; sampling from each would be independent samples
Dependent: An individual selected to be in one sample is used to determine the individual in the second sample.
e.g., sampling from one section of STA4173 and examining project grades over time would be dependent samples
Among competing acne medications, does one perform better than the other?
To answer this question, researchers applied Medication A to one part of the subject’s face and Medication B to a different part of the subject’s face to determine the proportion of subjects whose acne cleared up for each medication.
The part of the face that received Medication A was randomly determined.
Is this independent or dependent data?
Among competing acne medications, does one perform better than the other?
To answer this question, researchers applied Medication A to one part of the subject’s face and Medication B to a different part of the subject’s face to determine the proportion of subjects whose acne cleared up for each medication.
The part of the face that received Medication A was randomly determined.
Is this independent or dependent data?
This is dependent data – the data can be linked by person.
Do individuals who make fast-food purchases with a credit card tend to spend more than those who pay with cash?
To answer this question, a marketing manager randomly selects 30 credit-card receipts and 30 cash receipts to determine if the credit-card receipts have a significantly higher dollar amount, on average.
Is this independent or dependent data?
Do individuals who make fast-food purchases with a credit card tend to spend more than those who pay with cash?
To answer this question, a marketing manager randomly selects 30 credit-card receipts and 30 cash receipts to determine if the credit-card receipts have a significantly higher dollar amount, on average.
Is this independent or dependent data?
This is independent data – the data cannot be linked.
Are products purchased on Amazon less expensive than those purchased online at Walmart?
To answer this question, researchers randomly identified 20 products sold at both stores and determined the selling price at Amazon and the online Walmart store to determine if there was a significant difference in the price of the goods.
Is this independent or dependent data?
Are products purchased on Amazon less expensive than those purchased online at Walmart?
To answer this question, researchers randomly identified 20 products sold at both stores and determined the selling price at Amazon and the online Walmart store to determine if there was a significant difference in the price of the goods.
Is this independent or dependent data?
This is dependent data – the data can be linked by item.
We are now interested in comparing two independent groups.
We assume that the two groups come from different populations where
\mu_i is the mean for group i,
\sigma^2_i is the standard deviation for group i, and
N_i is the population size of group i.
After drawing samples, we have the following,
\bar{x}_i estimates \mu_i,
s^2_i estimates \sigma^2_i, and
n_i is the sample size for group i.
Because we are interested in comparing groups, we are interested in \mu_1-\mu_2.
(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2} \sqrt{\frac{s_1^2 }{n_1} + \frac{s_2^2}{n_2}}
where t_{\alpha/2} has \text{min}(n_1-1, n_2-1) degrees of freedom.
To construct this interval, we require either:
t.test() function to find the CI,In the Spacelab Life Sciences 2 payload, 14 male rats were sent to space.
Upon their return, the red blood cell mass (in milliliters) of the rats was determined.
A control group of 14 male rats was held under the same conditions (except for space flight) as the space rats, and their red blood cell mass was also determined when the space rats returned.
The project resulted in the following data:
library(tidyverse)
rbc <- c(8.59, 8.64, 7.43, 7.21, 6.39, 6.87, 7.89,
9.79, 6.85, 7.54, 7.00, 8.80, 9.30, 8.03,
8.65, 6.99, 8.40, 9.66, 7.14, 7.62, 7.44,
8.55, 8.70, 9.14, 7.33, 8.58, 9.88, 9.94) # enter blood cell mass
rat <- c(rep("Space",14), rep("Earth",14)) # enter identifier
data <- tibble(rat, rbc) # create dataset
t.test(rbc ~ rat, data = data, conf.level = 0.99)
Welch Two Sample t-test
data: rbc by rat
t = 1.4368, df = 25.996, p-value = 0.1627
alternative hypothesis: true difference in means between group Earth and group Space is not equal to 0
99 percent confidence interval:
-0.5130364 1.6116078
sample estimates:
mean in group Earth mean in group Space
8.430000 7.880714
If the 99% CI for \mu_\text{Earth}-\mu_\text{Space} is (-0.51, 1.61),
Is there a difference in red blood cell mass between the space and earth rats?
Is the difference smaller than 2 ml?
If the 99% CI for \mu_\text{Earth}-\mu_\text{Space} is (-0.51, 1.61),
Is there a difference in red blood cell mass between the space and earth rats?
Is the difference smaller than 2 ml?
t.test() function to perform the test,Important!!
We are estimating \mu_1 - \mu_2, but R is going to subtract in alphabetical or numeric order of the grouping variable.
e.g., if we have “Earth” and “Space”, it will estimate \mu_{\text{Earth}} - \mu_{\text{Space}}.
e.g., if we have “110” and “5”, it will estimate \mu_{5} - \mu_{110}.
In the case of two-tailed tets, this does not matter… but beware when doing a one-tailed test!
t.test() function to perform the test,What is the continuous variable?
What is the grouping variable?
t.test() function to perform the test,What is the continuous variable? rbc
What is the grouping variable? rat
Welch Two Sample t-test
data: rbc by rat
t = 1.4368, df = 25.996, p-value = 0.1627
alternative hypothesis: true difference in means between group Earth and group Space is not equal to 0
95 percent confidence interval:
-0.2365544 1.3351258
sample estimates:
mean in group Earth mean in group Space
8.430000 7.880714
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Fail to reject H_0.
There is not sufficient evidence to suggest that there is a difference in the red blood cell mass between earth and space rats.
One Sample t-test
data: data$worms
t = 10.582, df = 23, p-value = 2.6e-10
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
27.76022 38.48978
sample estimates:
mean of x
33.125
worms_trt <- data %>% filter(trt == "Treated") # only treated sheep
t.test(worms_trt$worms)[5] # mean only$estimate
mean of x
26.58333
$conf.int
[1] 19.13771 34.02895
attr(,"conf.level")
[1] 0.9
worms_untrt <- data %>% filter(trt == "Not") # only untreated sheep
t.test(worms_untrt$worms)[5] # mean only$estimate
mean of x
39.66667
$conf.int
[1] 32.48199 46.85134
attr(,"conf.level")
[1] 0.9
Putting our descriptives together:
What do we hypothesize will happen when we look at inference on \mu_{\text{T}} - \mu_{\text{U}}?
Welch Two Sample t-test
data: worms by trt
t = 2.2709, df = 21.972, p-value = 0.03331
alternative hypothesis: true difference in means between group Not and group Treated is not equal to 0
90 percent confidence interval:
3.189613 22.977054
sample estimates:
mean in group Not mean in group Treated
39.66667 26.58333
90% CI for \mu_{\text{U}} - \mu_{\text{T}} is (3.19, 22.98).
Is there significant evidence that the untreated lambs have a mean tapeworm count that is more than five units greater than the mean count for the treated lambs?
Welch Two Sample t-test
data: worms by trt
t = 2.2709, df = 21.972, p-value = 0.03331
alternative hypothesis: true difference in means between group Not and group Treated is not equal to 0
90 percent confidence interval:
3.189613 22.977054
sample estimates:
mean in group Not mean in group Treated
39.66667 26.58333
90% CI for \mu_{\text{U}} - \mu_{\text{T}} is (3.19, 22.98).
Is there significant evidence that the untreated lambs have a mean tapeworm count that is more than five units greater than the mean count for the treated lambs?
Let’s now look at the formal hypothesis test.
Is there significant evidence that the untreated lambs have a mean tapeworm count that is more than five units greater than the mean count for the treated lambs?
Welch Two Sample t-test
data: worms by trt
t = 1.403, df = 21.972, p-value = 0.08729
alternative hypothesis: true difference in means between group Not and group Treated is greater than 5
90 percent confidence interval:
5.470851 Inf
sample estimates:
mean in group Not mean in group Treated
39.66667 26.58333
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Reject H_0.
There is sufficient evidence to suggest that untreated lambs have a mean tapeworm count that is more than five units greater than the mean count for the treated lambs.
Let’s compare the two CIs for \mu_{\text{U}} - \mu_{\text{T}},
Note that when we are “close to the boundary” for one-tailed hypothesis testing, our two-sided CIs may not match the hypothesis test results.
Independent: An individual selected for one sample does not dictate which individual is to be in a second sample.
e.g., there are two sections of STA4173; sampling from each would be independent samples
Dependent: An individual selected to be in one sample is used to determine the individual in the second sample.
e.g., sampling from one section of STA4173 and examining project grades over time would be dependent samples
We are now interested in comparing two dependent groups.
We assume that the two groups come from the same population and are going to examine the difference,
d = y_{i, 1} - y_{i, 2}
After drawing samples, we have the following,
\bar{d} estimates \mu_d,
s^2_d estimates \sigma^2_d, and
n is the sample size.
\bar{d} \pm t_{\alpha/2} \frac{s_d}{\sqrt{n}}
where t_{\alpha/2} has n-1 degrees of freedom.
To construct this interval, we require either:
t.test() function to find the CI,Insurance adjusters are concerned about the high estimates they are receiving for auto repairs from garage I compared to garage II.
15 cars were taken to both garages for separate estimates of repair costs.
Paired t-test
data: garage$g1 and garage$g2
t = 6.0234, df = 14, p-value = 3.126e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.3949412 0.8317254
sample estimates:
mean difference
0.6133333
From the problem statement:
Our CI is (0.39, 0.83) – can we say that estimates from garage I are higher than those from garage II?
From the problem statement:
Our CI is (0.39, 0.83) – can we say that estimates from garage I are higher than those from garage II?
t.test() function to perform the test,Important!!
t.test().
Paired t-test
data: garage$g1 and garage$g2
t = 6.0234, df = 14, p-value = 1.563e-05
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
0.4339886 Inf
sample estimates:
mean difference
0.6133333
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Reject H_0.
There is sufficient evidence to suggest the estimates at garage I are higher than that of garage II.
Professor Neill measured the time (in seconds) required to catch a falling meter stick for 12 randomly selected students’ dominant and nondominant hands.
A coin flip is used to determine whether reaction time is measured using the dominant or nondominant hand first.
Professor Neill wants to know if the reaction time in an individual’s dominant hand is equal to the reaction time in their nondominant hand.
First, find the 99% CI for the difference in reaction times.
Then, formally test to determine if there is a difference; test at the \alpha=0.01 level.
Let’s first describe the data.
Our first step will be to find the difference so that we can look at \bar{d} and s_d.
students %>% summarize(mean(dom), sd(dom), # summary of dominant hand
mean(non), sd(non), # summary of non-dominant hand
mean(d), sd(d)) # summary of difference| Mean (Std. Dev) | |
|---|---|
| Dominant Hand | 0.180 (0.018) |
| Non-dominant Hand | 0.193 (0.018) |
| Difference (Dominant - Non-dominant) | -0.013 (0.016) |
Paired t-test
data: students$dom and students$non
t = -2.7759, df = 11, p-value = 0.01803
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
-0.02789797 0.00156464
sample estimates:
mean difference
-0.01316667
The 99% CI for \mu_d, where d = x_{\text{dominant}} - x_{\text{non-dominant}} is (-0.028, 0.002).
The 99% CI for \mu_d, where d = x_{\text{dominant}} - x_{\text{non-dominant}} is (-0.028, 0.002).
Can we say that there is a difference in the reaction times?
Paired t-test
data: students$dom and students$non
t = -2.7759, df = 11, p-value = 0.01803
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.023606270 -0.002727063
sample estimates:
mean difference
-0.01316667
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Fail to reject H_0.
There is not sufficient evidence to suggest that there is a difference in reaction times between dominant and non-dominant hands.